Goto

Collaborating Authors

 particular article


Extracting Memorized Training Data via Decomposition

arXiv.org Artificial Intelligence

The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.


Nikola Tesla's Amazing Predictions for the 21st Century

#artificialintelligence

In the 1930s journalists from publications like the New York Times and Time magazine would regularly visit Nikola Tesla at his home on the 20th floor of the Hotel Governor Clinton in Manhattan. There the elderly Tesla would regale them with stories of his early days as an inventor and often opined about what was in store for the future. Last year we looked at Tesla's prediction that eugenics and the forced sterilization of criminals and other supposed undesirables would somehow purify the human race by the year 2100. Today we have more from that particular article which appeared in the February 9, 1935, issue of Liberty magazine. The article is unique because it wasn't conducted as a simple interview like so many of Tesla's other media appearances from this time, but rather is credited as "by Nikola Tesla, as told to George Sylvester Viereck."


Supervised Learning for Document Classification with Scikit-Learn - QuantStart

#artificialintelligence

This is the first article in what will become a set of tutorials on how to carry out natural language document classification, for the purposes of sentiment analysis and, ultimately, automated trade filter or signal generation. This particular article will make use of Support Vector Machines (SVM) to classify text documents into mutually exclusive groups. Since this is the first article written in 2015, I feel it is now time to move on from Python 2.7.x and make use of the latest 3.4.x Hence all code in this article will be written with 3.4.x in mind. There are a significant number of steps to carry out between viewing a text document on a web site, say, and using its content as an input to an automated trading strategy to generate trade filters or signals. In this particular article we will avoid discussion of how to download multiple articles from external sources and make use of a given dataset that already comes with its own provided labels. This will allow us to concentrate on the implementation of the "classification pipeline", rather than spend a substantial amount of time obtaining and tagging documents. In subsequent articles in this series we will make use of Python libraries, such as ScraPy and BeautifulSoup to automatically obtain many web-based articles and effectively extract their text-based data from the HTML.